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ABSTRACT 

The problem of mapping items to skills is gaining interest 
with the emergence of recent techniques that can use data 
for both defining this mapping, and for refining mappings 
given by experts. We investigate the problem of refining 
mapping from an expert by combining the output of dif- 
ferent techniques. The combination is based on a partition 
tree that combines the suggested refinements of three known 
techniques from the literature. Each technique is given as 
input a Q-matrix, that maps items to skills, and student test 
outcome data, and outputs a modified Q-matrix that consti- 
tutes suggested improvements. We test the accuracy of the 
partition tree combination techniques over both synthetic 
and real data. The results over synthetic data show a high 
improvement over the best single technique with a 86% error 
reduction on average for four different Q-matrices. For real 
data, the error reduction is 55%. In addition to the substan- 
tial error reduction, the partition tree refinements provide 
a much more stable performance than any single technique. 
These results suggest that the partition tree is a valuable 
refinement combination approach that can effectively take 
advantage of the complementarity of the Q-matrix refine- 
ment techniques. It brings the goal of using a data driven 
approach to refine the item to skill mapping closer to real 
applications, although challenges remain and are discussed. 

1. INTRODUCTION 

Defining which skills are involved in a task is non trivial. 
Whereas task outcome is observable, skills are not. This 
layer of opacity leaves a world of possibilities to define which 
skills are behind task performance, and no obvious evidence 
to know if the proposed definition is correct or not. Means to 
provide such feedback would be highly valuable to teachers 
and designers of learning environments, and we find numer- 
ous recent efforts towards this end in the last few years. 
They are reviewed in section 2. 

We developed an approach that takes the outpout of a com- 
bination of techniques to detect likely errors of task to skills 


mappings given by experts. We investigate the combination 
of three data-driven techniques [3, 2, 7] based on a partition 
tree algorithm that creates binary partitions. See also [6] 
for a more detailed comparison of the performance of these 
thee techniques. 

The performance of the partition tree approach is tested 
over synthetic and real data. But even in the case of real 
data, the approach to grow the partition tree trains on syn- 
thetic performance data generated from a set of Q-matrices 
that are similar to the Q-matrix to refine. This procedure is 
chosen because only synthetic data provides a large enough 
training set, and because it also provides ground truth la- 
belling of latent variables. 

In the rest of this text we use the term items to refer to ques- 
tions or tasks that can be part of a formative or summative 
assessment, or exercises within an e-learning environment. 
Skills can be the mastery of concepts, factual knowledge, or 
any ability involved in item outcome success. However, the 
models reviewed here assume a static student skills state, 
as opposed to the Knowledge Tracing model and its deriva- 
tives [11], for example, which rely on dynamic data. We 
return to this limitation in the Discussion. 

The different techniques to validate a Q-matrix are first de- 
scribed, followed by the description of the approach, the 
experiments, and the results. 

2. Q-MATRICES AND TECHNIQUES TO 
VALIDATE THEM FROM DATA 

A mapping of item to skills is 
termed a Q-matrix. An exam- 
ple of a 11 items and 3 skills 
Q-matrix is given beside. It 
corresponds to the Q-matrix la- 
belled QM 1 in the results sec- 
tion below. From this exam- 
ple, item 4 requires skill 1 only, 
whereas item 11 requires skills 1 
and 2. If all specified skills are 
required to succeed the item, the 
Q-matrix is labeled conjunc- 
tive. If a any of the required 
skill is sufficient to the item’s 
success, then it is labeled dis- 
junctive. The compensatory 
version corresponds to the case 


Q-matrix QM-1 
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where each required item increases the chances of success in 
some way. Conjunctive Q-matrices the most common and 
all matrices of the experiments here are of this type. 

The conjunctive/disjunctive distinction is also referred to as 
AND/OR gates. Skills models such as DINA (Deterministic 
Input Noisy AND) and DINO (Deterministic Input Noisy 
Or) make reference to this AND/OR gates terminology. 

The DINA model [10] defines the probability of success to 
an item as a function of whether the skills required are mas- 
tered, and of two parameters, the slip and guess factors. 
Mastery is a binary value based on the conjunctive frame- 
work: if all required skills are mastered then the value is 1, 
else it is 0. Slip and guess parameters are values that gen- 
erally vary on a [0, 0.2] scale. The probability of success to 
an item j by a student i is thereby defined as: 

P(X ij = 1|£«) = (1 -srf'ig]-*** 

where tjij is 1 if student i masters all required skills of item j, 

0 otherwise. Sj and gj are the slip and guess factors. 

Two techniques for Q-matrix validation surveyed here rely 
on the DINA model, whereas the third one relies on a ma- 
trix factorization technique called ALS (Alternative Least 
Squares), or more precisely ALSO for the conjunctive version 
of the technique. We briefly review each technique below. 

2.1 Technique 1: MinRSS 

Chiu defines a method that minimizes the residual sum of 
square (RSS) between the real responses and the ideal re- 
sponses that follow from a given Q-matrix [2] under the 
DINA model. The algorithm adjusts the Q-matrix by first 
estimating the mastery of each student, then choosing the 
item with the worst RSS over to the data, and replacing it 
with a q-vector that has the lowest RSS, and iterates until 
convergence. We refer to this technique as MinRSS . 

2.2 Technique 2: MaxDiff 

The method defined by de la Torre [3] searches for a Q- 
matrix that maximizes the difference in the probabilities of 
a correct response to an item between examinees who pos- 
sess all the skills required for a correct response to that item 
and examinees who do not. It also relies on the DINA model 
to determine item outcome probability, and on an EM algo- 
rithm to estimate the slip and guess parameters. Probabil- 
ity differences represents an item discrimination index: the 
greater the difference between the probability of a correct 
response given the skills required and the probability given 
missing skills, the greater the item is discriminant. As such, 
we can consider that the method finds a Q-matrix that max- 
imizes item discrimination over all items. We refer to this 
technique as MaxDiff . 

2.3 Technique 3: Conjunctive alternate 
Least-Square Factorization (ALSC) 

The Conjunctive alternate Least-Square Factorization (ALSC) 
method is defined in [7]. Contrary to the other two meth- 
ods, it does not rely on the DINA model as it has no slip 
and guess parameters. ALSC decomposes the results matrix 
R-mxn of m items by n students as the inner product two 


smaller matrices: 

^R=Q^S (1) 

where ^R is the negation of the results matrix (m items by 
n students), Q is the m items by k skills Q-matrix, and -iS is 
negation of the the mastery matrix of k skills by n students 
(normalized for rows columns to sum to 1). By negation, we 
mean the 0- values are transformed to 1, and non-0- values 
to 0. Negation is necessary for a conjunctive Q-matrix. 

The factorization consists of alternating between estimates 
of S and Q until convergence. Starting with the initial ex- 
pert defined Q-matrix, Q () , a least-squares estimate of S is 
obtained: 

-So = (Qj Qo)- 1 Qo -R (2) 

Then, a new estimate of the Q-matrix, Qi, is again obtained 
by the least-squares estimate: 

Qi ^R-Sjf-So-SJr 1 (3) 

And so on until convergence. Alternating between equa- 
tions (2) and (3) yields progressive refinements of the ma- 
trices Qi and S, that more closely approximate R in equa- 
tion (1). The final Qi is rounded to yield a binary matrix. 

Note that (-Qi' -Qi) or (-■Si -iS J)i may not be invert- 
ible, for example in the case where the matrix Qi is not 
column full-rank, or the matrix Si is not row full-rank. This 
is resolved by adding a very small Gaussian noise before 
attempting the matrix inverse. 

2.4 Other techniques 

We chose the three techniques described above as the can- 
didates to combine refinements that can potentially provide 
more accurate suggestions than any of the individual ones, 
but any other equivalent technique could also be combined in 
the same fashion instead of the three chosen ones here. Po- 
tential candidates could be, for example, a technique based 
on a Bayesian approach by DeCarlo et al. [5], and recent 
techniques that rely on time information [13, 12]. Yet an- 
other recent approach relies item text [8] to establish the 
mapping of items to skills. 

Although the results obtained through a combination of 
techniques may vary as a function of the specific techniques 
chosen, the general principle remains valid for all possible 
combinations. And there is no reason to believe that the par- 
ticular combination of the current study is better or worse 
than other potential combinations. 

2.5 General validation principle 

The general idea behind the validation of Q-matrices is to 
introduce a perturbation to a matrix and run a refinement 
technique that takes the perturbed matrix and test data 
as input, and outputs a set of refinements. In all, 8 cases 
can occur and they are listed in table 1. The 8 cases are 
a combination of the original cell value, perturbation, and 
value proposed (2x2x2). 

The outcome of a proposed value from the refinement tech- 
nique is considered correct if it corresponds to the original 
value before the perturbation, and incorrect otherwise. We 
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Table 1: Refinement outcomes 



Perturbation 

Refinement 


Value 

Value 

Value 

Outcome 


before 

after 

proposed 




Perturbed cell 


(1) 

0 

1 

0 

correct (TP) 

(2) 

1 

0 

1 

correct (TP) 

(3) 

0 

1 

1 

wrong (FN) 

(4) 

1 

0 

0 

wrong (FN) 



Non Perturbed cell 


(5) 

0 

0 

0 

correct (TN) 

(6) 

1 

1 

1 

correct (TN) 

(7) 

0 

0 

1 

wrong (FP) 

(8) 

1 

1 

0 

wrong (FP) 


also refer to the signal detection terminology with respect to 
perturbations to introduce further classification of the error 
types: 

• True Positives (TP): perturbed cell that was cor- 
rectly changed 

• True Negatives (TN): non perturbed cell left un- 
changed 

• False Positives (FP): non perturbed cell incorrectly 
changed 

• False Negatives (FN): perturbed cell left unchanged 


3. COMBINING TECHNIQUES WITH A 
PARTITION TREE 

Each of the technique described above uses a different al- 
gorithm to provide a potentially improved Q-matrix. In 
that respect, their respective outcome may be complemen- 
tary, and their combined outcome can be more reliable than 
any single one. This is the first hypothesis and objective 
of our study. Furthermore, some algorithms are more effec- 
tive in general, but may not be the best performer in all 
context. Identifying in which context an algorithm provides 
the most reliable outcome is another objective of combin- 
ing these techniques. We will see that the first hypothesis 
is confirmed in the results of the partition tree labeled (1) 
and the second is also confirmed by the results of partition 
tree (3). 


3.1 Partitioning tree 

To implement the partition tree combination of the three 
techniques, we chose the rpart package for this purpose [19] . 

The rpart package builds classification models that can be 
represented as binary trees. The tree is constructed in a 
top-down recursive divide and conquer approach. At each 
node in the tree, cases are split into two groups based on 
their attribute value. 


3.1.1 Tree building 

Attribute selection is done on the basis of Gini index in 
rpart. The Gini index [16] can be calculated as : 

n 

Gini(R) = 1 - p] 
i=i 

where n is the number of classes and pj is the relative fre- 
quency of class j in dataset D. If attribute A is chosen to 
be a split on dataset D into two subset D i and D2, then the 
Gini index for attribute A is defined as: 

Gini a (D) = ^Gini(Di) + ^Gini(D 2 ) 

Once we get the Gini index to add attributes we can calcu- 
late a Delta reduction for each attribute: 

AGini(A) = Gini(D) - Gini A (D) 

The attribute that creates the largest reduction can be cho- 
sen as a splitting point in the decision tree. 

3.1.2 Classification with the tree 

In our case, attributes are sometimes numeric, such as fac- 
tors, and sometimes binary, such as cell values in the Q- 
matrix. And the class is binary since it is also a Q-matrix 
cell value. At each point of decision from the root node of 
the tree to a leaf node, a choice is made to go left or right 
based on the splitting point of each node. The nodes in the 
partition trees of this experiment are the output of the tech- 
niques (suggested values) and the factors considered (they 
are described in the next section). 

Once a leaf node is reached, classification is based on the 
majority vote of the cases that fall under that leaf node: if 
the training set contained more case labeled ’O’, this is the 
proposed value, else it is a T’. 

3.2 Factors considered 

The partition tree relies on each technique’s output, the Q- 
matrix refinement proposition, and on a number of factors 
that may provide information about the most reliable tech- 
nique refinement in a given context. The factors considered 
to be relevant are the following: 

• Skills per row. Items can require one or more skills. 
The skills per row indicates the number of skills re- 
quired. 

• Skills per column. The sum of the skills per columns 
is an indicator of how often this skill is measured by 
the different items of the Q-matrix. 

• Stickiness. If a technique systematically proposes a 
change to a cell of the Q-matrix, no matter what the 
perturbation is, this is an indication that this particu- 
lar change to the original Q-matrix is an artifact of the 
structure of the Q-matrix and the algorithm. We call 
this property the stickiness of a cell of the matrix and 
it is measured by the proportion of times the value of 
the cell is incorrectly changed over all perturbations. 

Recall that we train the partition tree over synthetic 
data for which the ground truth is known. We can 
therefore reliably identify incorrect changes. This is 
detailed below. 
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3.3 Training of the partition tree 

The partition tree is trained on data that contains the fol- 
lowing set of attributes: 

• original ( j k y. value of cell (j, k) in the original matrix. 
This is the target class of the partition tree and it 
corresponds to “Value before” in table 1. 

• MaxDiff fjM,MinRSS y,k), ALSCyMi the three values 
proposed as refinements by the respective technique in 
place of the original value. For every record, at least 
one of these must be different from the original one, or 
else it is a perturbated cell record. This corresponds 
to “Value proposed” in table 1, one for each refinement 
technique. 

• RS Q it j , CS Qi y. the number of skills per row and column 
attributes (see section 3.2). These factors are per Q- 
matrix, Q;, and per row j and column k. 

• SFMaxDiff(Qi ,j.k) i SfMinRSS(Qi 5 

SFalscw.j,*): the stickiness factors of the cell, one for 
each matrix and technique. 

The training data is generated through a perturbation pro- 
cess. Each cell of a Q-matrix is perturbated, in turn and 
one at a time, to create a new training record containing 
the above attributes. However, non perturbated cells that 
are left unchanged by all refinements techniques, cases (5) 
and (6) in table 1, are left out of the training data because 
they were assumed to be uninformative. 

The size of the data set to train the partition trees over 
is very large. For the permutations of a single Q-matrix, 
the number of perturbated and non perturbated cells ranges 
from approximately 50,000 to 250,000. 


Training of the partition tree for expert Q-matrices with 
synthetic data. Whereas for synthetic data, we can gen- 
erate a large array of Q-matrices and ample training and 
testing data, real data poses a challenge in that respect. 
Typically, for a single data set, we have only a few ex- 
pert Q-matrices, and often a single one is available. For a 
3 skills x 11 items matrix, only 33 single perturbations are 
possible to train a partition tree. Furthermore, and unlike 
synthetic data, we do not know what are the valid refine- 
ments in the Q-matrix. A “sticky” cell might be a valid 
refinement, and so can some of the perturbations that are 
presumed incorrect. 

To get around these issues, the training of the partition tree 
is conducted over synthetic data where the ground truth is 
know and where we can use a large span of matrices similar 
to the expert one. Similarity to the Q-matrix to refine is 
achieved by random permutations the cells of the original 
Q-matrix. For each Q-matrix, a total of 1000 Q-matrices 
are generated through this permutation process. Item out- 
come data for 400 simulated students is also generated. The 
R package CDM and the sim.din function [15] is used for 
generating synthetic student item outcome data, using 0.2 
slip and guess factors. 

4. REAL DATA AND Q-MATRICES 

The primary source of real data for our study, from which the 
synthetic data is also mimicked, is the well known data set 


Table 2: Four Q-matrices over 11 items of Tatsuoka’s 
data set on student item outcome 



Number of 

— Dp^rrintinn 


skills 

items 

cases 


QM 1 

3 

11 

536 

Expert driven. 

Skill 1 shared by all 
items. From [9] 

QM 2 

5 

11 

536 

Expert driven. 
From [3] 

QM 3 

3 

11 

536 

Expert driven. 
Single skill per 
item. [15] 

QM 4 

3 

11 

536 

Data driven, 
SYD-based. 


on fraction algebra problems from Tatsuoka [17] (see table 1 
in [4] for a description of the problems and of the skills). 
The data contains complete answers of 536 students to 20 
questions items, but only a subset of 11 items are used by the 
Q-matrices in the current study. It corresponds to the set of 
common items to the different Q-matrices of the experiment. 

The original Q-matrix of this data set contains 8 skills and, 
as mentioned, 20 items. However, a number of variations of 
this matrix have been proposed and studied with a smaller 
number of skills and items [9, 3, 15]. We also chose to focus 
on this smaller skills set since they offer three very differ- 
ent expert-defined Q-matrices over the same set of items. 
Moreover, a smaller set of skills allows us to better establish 
the validity of the approach on a simpler problem, leaving 
for later the demonstration of whether it scales correctly to 
larger sets. The Q-matrices are described below. 

Four Q-matrices are considered. Three of them have been 
studied in the literature and one is defined by ourselves. 
Their main attributes are reported in table 2 and the actual 
Q-matrices are shown in figure 1 (except for QM 1 which is 
introduced in section 2). 



As mentioned, all Q-matrices are derivatives of the Tat- 
suoka [17] 20 item set. QM-1, QM-2 and QM-3 are available 
from the CDM package. All data sets have 3 skills, except 
for data set 2 which has 5 skills. Data set 3 is the only one 
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with a single skill per item. Matrix QM 4 was created for the 
purpose of this study, using the three largest singular values 
and the items to skills V matrix of the SVD decomposition 
of the Tatsuoka data mentioned above. 

Therefore, while these four Q-matrices all share the same 
11 items, they vary by the number of skills, item monoticity 
or not, whether a skill is common to all items, and whether 
they are driven from data or driven from expert analysis of 
item skills involved. 

5. GENERAL PROCEDURE SUMMARY 
AND METHODOLOGICAL NOTES 

To ease the understanding of the general process of the ex- 
periments, and at the expense of introducing some redun- 
dancy, figure 2 summarizes the main steps and dependencies. 
The top greyed box illustrates the process to generate the 
data for partition trees training, and the synthetic data for 
performance evaluation. The bottom greyed box illustrates 
the two test procedures for real and synthetic data. We 
explain the figure below and fill in some details as well. 


Data generation. For each of the four Q-matrices (QMJ, 
the data generation process (1) 1000 permutations (2). Du- 
plicates are kept if any. For each permutation, synthetic test 
outcome data of 400 simulated students is created with the 
CDM utility sim.din (3). Finally, each QM is perturbated, 
and that Q-matrix is fed to each of the three techniques 
to generate training data for the partition tree described in 
section 3.3 (4). 

Test over real data. The experiment to assess the perfor- 
mance over real data takes three sources of input: the Q- 
matrices (1), the fraction algebra data set of Tatsuoka as 
described in 4 (6), and finally a partition tree (5) trained 
from data generated (4). It outputs a set of refinements 
from the different partition trees and for each of the three 
techniques as well (7). Finally, the refinements are compared 
with the original Q-matrices in (1). 

Test over synthetic data. For assessing the performance 
over synthetic data (9), the process is similar, with the main 
difference that refinements are based on the synthetic test 
outcome data generated in (3) instead of real data. And the 
comparison is not done over the Q-matrices in (1), but in- 
stead over the permuted Q-matrices in (2), which constitute 
the ground truth as they are used to generate the data. 

5.1 Data set size, cross-validation, and the 
assumption of correctness of expert 
Q-matrices 

As shown in figure 2, synthetic test outcome data (3) is used 
for both the training of the partition trees and testing over 
synthetic data. This large data set (see sect. 3.3) leaves little 
space for over fitting of the partition trees, and therefore the 
cross-validations bring very small differences in performance: 
accuracy/RSS error reduction is the same between a cross- 
validated and a non cross-validated performance assessment 



Figure 2: General validation procedure for each Q- 
matrix (QMJ. See section 5 for details. 


at the 0.01 level reported in the results below. 

However, for real data, the size of the testing data set is 
much smaller. It varies between 366 (QM-2) and 561 (QM- 
3), because the test data is based solely on the permutations 
of the four Q-matrices. But because the test procedure uses 
partition trees trained from synthetic data, there are no bias 
issues and cross validation is not required here. 

Note also that, for real data, the expert-defined Q-matrix is 
not necessarily consistent with the (unknown) ground truth. 
Nevertheless, we consider this Q-matrix as valid and the 
evaluation of the proposed refinements are made by compar- 
ing refinements with expert-defined Q-matrices, as though 
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Table 3: Results for synthetic data 


Table 4: Results for real data 


QM 


Technique 


Partition tree 

MinRSS 

MaxDiff 

ALSC 

(1) 

(2) 

(3) 


Accuracy of perturbated cells 



1 

0.81 

0.47 

0.82 

0.81 

0.88 

0.95 

2 

0.07 

0.26 

0.36 

0.52 

0.53 

0.83 

3 

0.96 

0.49 

0.95 

0.99 

1.00 

1.00 

4 

0.90 

0.49 

0.85 

0.90 

0.92 

0.96 

X 

0.69 

0.43 

0.75 

0.81 

0.83 

0.93 


Accuracy of non 

perturbated cells 


1 

0.97 

0.56 

0.44 

0.97 

0.91 

0.99 

2 

0.99 

0.53 

0.50 

0.99 

0.99 

0.99 

3 

0.95 

0.26 

0.74 

0.95 

0.94 

0.99 

4 

0.97 

0.56 

0.44 

0.97 

0.97 

1.00 

X 

0.97 

0.48 

0.53 

0.97 

0.95 

0.99 



F-score 




1 

0.88 

0.51 

0.58 

0.88 

0.90 

0.97 

2 

0.13 

0.35 

0.42 

0.68 

0.69 

0.90 

3 

0.96 

0.34 

0.83 

0.97 

0.97 

1.00 

4 

0.93 

0.52 

0.58 

0.93 

0.94 

0.98 

X 

0.72 

0.43 

0.60 

0.87 

0.87 

0.96 


QM 


Technique 


Partition tree 

MinRSS 

MaxDiff 

ALSC 

(1) 

(2) 

(3) 


Accuracy of perturbated cells 



1 

0.39 

0.17 

0.52 

0.39 

0.36 

0.67 

2 

0.35 

0.09 

0.56 

0.60 

0.62 

0.64 

3 

0.27 

0.09 

0.36 

0.61 

1.00 

0.88 

4 

0.42 

0.11 

0.58 

0.42 

0.48 

0.61 

X 

0.36 

0.12 

0.51 

0.51 

0.62 

0.70 


Accuracy of non 

perturbated cells 


1 

0.45 

0.68 

0.56 

0.45 

0.38 

0.60 

2 

0.93 

0.93 

0.28 

0.94 

0.94 

0.97 

3 

0.64 

0.83 

0.42 

0.69 

0.76 

0.78 

4 

0.55 

0.89 

0.32 

0.55 

0.52 

0.51 

X 

0.52 

0.68 

0.32 

0.62 

0.62 

0.68 



F-score 




1 

0.42 

0.27 

0.54 

0.42 

0.37 

0.63 

2 

0.50 

0.17 

0.37 

0.73 

0.74 

0.77 

3 

0.38 

0.16 

0.39 

0.64 

0.86 

0.83 

4 

0.48 

0.20 

0.42 

0.48 

0.50 

0.56 

X 

0.45 

0.20 

0.43 

0.57 

0.62 

0.70 


they were the ground truth. We should keep in mind that 
the performance score may be negatively biased if this as- 
sumption was false, but for the purpose of comparing the 
relative techniques performance among themselves, and if 
we assume that all techniques are equally affected by this 
bias, then it makes no difference to our relative results. 

6. PERFORMANCE MEASURES 

To measure the performance of the proposed refinements, 
we use the difference between the original Q-matrix and the 
proposed refinement of a technique. We use the classification 
of correct and incorrect refinements introduced in table 1. 
Cells that are neither perturbated nor incorrectly suggested 
as refinements by any of the technique are ignored in the 
analysis (the true negatives of table 1, TN). This is the case 
of the large majority and it also is consistent with the train- 
ing of the partition tree for which they are also filtered out. 

Recovery of a perturbated cell to its original value can be 
considered as a recall measure, whereas the non perturbated 
cells that are left unchanged can be considered as a precision 
measure. In that respect, we define a performance measure 
that combines precision and recall of the refinement tech- 
nique into a single F-score measure: 

precision x recall 

F-score = 2 x - 

precision + recall 

„ Acc -,p x Accp 

= 2 x 

Acc -,p + Accp 

where Accp and Acc ^p are respectively the accuracy mea- 
sure of the proposed refinements for the perturbated and 
non perturbated cells. This measure gives equal weight to 
both types of accuracies and avoids a bias in favour of the 
accuracy of the non perturbated cells which can considerably 


outweigh in number the single perturbated cell, even after 
filtering out non-perturbated cells that are left unchanged. 

7. RESULTS 

The results are reported in tables 3 and 4. The format of 
these tables first described below. 

7.1 Description 

The respective results of the four Q-matrices (column QM) 
in table 2 are reported. They correspond to a single run 
(real data can vary a few percentage points by run, but it is 
practically stable for synthetic data due to the large number 
of cases). The accuracy of refinement for perturbated and 
non perturbated cells are reported separately, followed by 
the F-score which combines both types of accuracy. The 
averages of the four matrices for each of these these three 
performance measures is also reported as X. 

The accuracy and F-score of each individual technique is 
reported under columns MinRSS , MaxDiff , and ALSC. 

The three columns under Partition tree correspond to the 
performance as a function of different factors used for build- 
ing the tree: 

(1) MinRSS + MaxDiff + ALSC. Only the output of 
the three refinement techniques is considered. 

(2) MinRSS + MaxDiff + ALSC + SR + SC. The 

number of skills per row (SR) and skills per column 
(SC) of the target cell are taken into account in ad- 
dition to the output of each technique. If some tech- 
nique performs better under some combination of SR 
and SC, this tree will be able to take these factor into 
account. 
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(3) MinRSS + MaxDiff + ALSC + SR + SC + 
Stickiness. MinRSS + Stickiness. MaxDiff + Stick- 
iness. ALSC. The tendency of a cell to be a false pos- 
itive for the MinRSS and ALSC methods are added. 
The Stickiness factor with MaxDiff is omitted here be- 
cause it did not yield improvements. 

7.2 Synthetic data 

The results for synthetic data in 3 show large differences be- 
tween the different matrices and across the individual tech- 
niques. 

The MinRSS method is clearly superior in terms of gen- 
eral accuracy, except for the 5-skills Q-matrix where it can 
only identify the perturbated cell 7% of the time, and which 
brings its average below the ALSC technique. However, 
because it introduces fewer false positives (incorrect refine- 
ments) than other techniques, it outperform the other two 
methods on the F-Score. 

On average, the ALSC technique is good at identifying the 
perturbated cell with a 75% average, but it also tends to 
introduce more false positives and consequently obtains a 
lower global F-score than MinRSS . 

Another noticeable result is that the results for QM 3 are 
very good, in particular for the partition trees which have 
perfect performance (rounding at the second decimal). This 
is likely attributed to the fact that it defines a single-skill 
mapping. 

Turning to the main questions addressed in this study, the 
results of partition tree (1), which uses only the three tech- 
niques’ output, is equal or better on all scores than any 
individual one. This confirms the initial hypothesis for syn- 
thetic data. Furthermore, the inclusion of factors (partition 
trees (2) and (3)) also substantially improves all scores, con- 
firming the other hypothesis that some techniques perform 
better under a combination of factors and that the partition 
tree is effectively able to take advantage of this information. 
The stickyness factor is by far the most effective. 

7.3 Real data 

The results over the real data reported in table 4 show the 
same trends as the synthetic data, but bring less pronounced 
improvements. They also support both hypothesis. 

We do find an exception with the non perturbated cells 
where the MaxDiff accuracy is above the partition trees (1) 
and (2) and close to (3). This is mainly due to the fact 
that more “false positives” are generated by the MinRSS 
and ALSC techniques for real data than for synthetic data, 
whereas the MaxDiff technique outputs very few changes 
in both contexts. That observation is consistent with the 
results in [6]. 

The balance between true positives and true negatives il- 
lustrates why the F-score should be the reference: a perfect 
score could be obtained over the accuracy of non perturbated 
cells if no changes are always suggested, but that would make 
such refinement technique useless. 

Therefore, turning to the F-scores, the tendencies are highly 


consistent with the synthetic data. The F-score of the best 
performer, 0.41 of MinRSS , is improved to 0.55 with the 
combination of the three techniques, and to 0.66 when all 
factors are included in the partition tree. 

8. DISCUSSION 

The results of the above experiments show that the combi- 
nation of Q-matrix refinement techniques using a partition 
tree can bring substantial improvements over the best per- 
formance of the individual techniques. For synthetic data 
the average best F-score of the MinRSS technique, 0.72, is 
improved to 0.96, and for real data it is raised from 0.41 
to 0.66. These results represent a 86% and 55% error re- 
duction for the F-score of the synthetic and the real data 
respectively (error reduction = 1 — (1 — F')/{ 1 — F), where 
F is the initial F-score and F' is the improved F-score). 

In practical terms, if the best technique finds an error in 
a Q-matrix 5 out of 10 times, an error reduction of 40% 
represents an increase from 5, to 7 out of 10 times, and 
the same ratio applies to false errors reduction. And these 
figures rest on the assumption that we would know which 
technique is the best, whereas according to table 4’s results 
the best technique varies across Q-matrices. 

Another positive note on the results is that the partition tree 
F-scores are more stable across Q-matrices and are system- 
atically better than any individual technique when all factors 
are taken into account (partition tree 3). This regularity in- 
curs that, at least in the space of Q-matrices surveyed, one 
can safely choose partition tree refinements without con- 
cerns that, maybe, another technique could deliver better 
refinements for a specific Q-matrix. 

In spite of these encouraging results, limitations and issues 
remain. 

One limit is that the results are from a single 11 items set, 
and from a single domain. We can reasonably believe that 
the results vary across contexts and more investigation is 
required to assess this variability. 

Another limitation is the models investigated in the current 
study use static student data: they assume that skill mas- 
tery does not change for a single student. This assumption 
is false for most data gathered in learning environments, 
where students take on exercises as they learn and are being 
assessed throughout the learning process. This type of data 
can be labeled as dynamic item outcome data because a stu- 
dent will be in different states of skills mastery as learning 
occurs. 

In order to effectively use the existing techniques of Q-matrix 
refinement, we would need to be able to detect the moment 
when the state of skill mastery changed. Failure do do so 
would create noise in the data and impair the effectiveness 
of these techniques. Fortunately, substantial progress has 
been done in the recent decade or two towards detecting 
the moment of learning, such as the large body of work 
on Bayesian Knowledge Tracing and Tensor factorization 
(for eg. [1, 18]). We can also cite the work of [14] who 
refer to a time-varying skills matrix for students and test 
their approach on synthetic data. But apart from this recent 
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contribution, little work has been done on using this type of 
data for refining a Q-matrix, and we can only expect existing 
techniques to under perforin with dynamic student data. 
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